Crowdsourcing Microdata for Cost-Effective and Reliable Lexicography
نویسنده
چکیده
Lexicography has long faced the challenge of having too few specialists to document too many words in too many languages with too many linguistic features. Great dictionaries are invariably the product of many person-years of labor, whether the lifetime work of an individual or the lengthy collaboration of a team. Is it possible to use public contributions to vastly reduce the time and cost of producing a dictionary while ensuring high quality? Crowdsourcing, often seen as the solution for large-scale data acquisition or analysis, is fraught with problems in the context of lexicography. Language is not binary, so there may be no one right answer to say that a word “means” a particular definition, or that a word in one language “is” the same as a particular translation term. People may misinterpret instructions or misread terms or make typographical or conceptual errors. Some crowd members intentionally add bad data. Without a payment system, incentives for participation are slim; micro-payments introduce the incentive to maximize income over quality. Our project introduces a public interface that breaks lexicographic data collection into targeted microtasks, within a stimulating game environment on Facebook, phones, and the web. Players earn points for answers that win consensus. Validation is achieved by redundancy, while malicious users are detected through persistent deviations. Data can be collected for any language, in an integrated multilingual framework focused on the serial production of monolingual dictionaries linked at the concept level. Questions are sequential, first eliciting a lemma, then a definition, then other information, according to a prioritized concept list. The method can also be used to merge existing data sets. Intensive trials are currently underway in Vietnamese, with the inclusion of additional Asian languages an explicit objective.
منابع مشابه
Reliable Diversity-Based Spatial Crowdsourcing by Moving Workers
With the rapid development of mobile devices and the crowdsourcing platforms, the spatial crowdsourcing has attracted much attention from the database community, specifically, spatial crowdsourcing refers to sending a location-based request to workers according to their positions. In this paper, we consider an important spatial crowdsourcing problem, namely reliable diversity-based spatial crow...
متن کاملPerform Three Data Mining Tasks with Crowdsourcing Process
For data mining studies, because of the complexity of doing feature selection process in tasks by hand, we need to send some of labeling to the workers with crowdsourcing activities. The process of outsourcing data mining tasks to users is often handled by software systems without enough knowledge of the age or geography of the users' residence. Uncertainty about the performance of virtual user...
متن کاملCreating a Bi-lingual Entailment Corpus through Translations with Mechanical Turk: $100 for a 10-day Rush
This paper reports on experiments in the creation of a bi-lingual Textual Entailment corpus, using non-experts’ workforce under strict cost and time limitations ($100, 10 days). To this aim workers have been hired for translation and validation tasks, through the CrowdFlower channel to Amazon Mechanical Turk. As a result, an accurate and reliable corpus of 426 English/Spanish entailment pairs h...
متن کاملCreation of Reliable Relevance Judgments in Information Retrieval Systems Evaluation Experimentation through Crowdsourcing: A Review
Test collection is used to evaluate the information retrieval systems in laboratory-based evaluation experimentation. In a classic setting, generating relevance judgments involves human assessors and is a costly and time consuming task. Researchers and practitioners are still being challenged in performing reliable and low-cost evaluation of retrieval systems. Crowdsourcing as a novel method of...
متن کاملCost-Saving Effect of Crowdsourcing Learning
Crowdsourcing is widely adopted in many domains as a popular paradigm to outsource work to individuals. In the machine learning community, crowdsourcing is commonly used as a cost-saving way to collect labels for training data. While a lot of effort has been spent on developing methods for inferring labels from a crowd, few work concentrates on the theoretical foundation of crowdsourcing learni...
متن کامل